Skip to main content

Getting Started

This page provides instructions for setting up and running the GovDoc Scanner project.

Prerequisites

  • Node.js: v18.x or newer (recommended: v20.x)
  • Git: For cloning the repository
  • Gemini API Key: Required for AI-powered document processing (Get one here)

Quick Start

1. Clone and Install

git clone https://github.com/flexivian/govdoc-scanner.git
cd govdoc-scanner
npm install

2. Environment Setup

cp .env.example .env
# Edit .env and add your Gemini API key:
# GEMINI_API_KEY=your_gemini_api_key_here

3. First Run

Test with the orchestrator (recommended):

npx nx run orchestrator:start
# Select: 1) Single ID
# Enter GEMI ID: 123204604000
# Watch the automated process

Project Applications

The project includes three main applications:

  • Orchestrator: Complete end-to-end workflow (recommended)
  • Crawler: Search and download documents from GEMI portal
  • Doc-Scanner: Process documents and extract metadata

Individual Application Usage

npx nx run orchestrator:start
# Combines crawler + doc-scanner automatically
# Shows progress bars and summaries
# Output saved to project root ./output/

Crawler

npx nx run crawler:start
# Search for companies or download by GEMI ID
# Results saved to apps/crawler/src/downloads/

Doc-Scanner

npx nx run doc-scanner:start
# Process documents from input directory
# Requires manual document placement in apps/doc-scanner/src/data/input/

Output Structure

After processing, find results in:

output/
├── 123204604000/
│ ├── 123204604000_final_metadata.json
│ └── document_downloads/
│ └── *.pdf, *.docx files
└── govdoc-output.json # Summary

Next Steps

Troubleshooting

  • API Key Issues: Ensure valid Gemini API key in .env file
  • Browser Issues: Run npx playwright install chromium
  • Permissions: Check write access to output directories